import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')
titanic = pd.read_csv('titanic.csv')
titanic.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 892 | 0 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q |
| 1 | 893 | 1 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S |
| 2 | 894 | 0 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q |
| 3 | 895 | 0 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S |
| 4 | 896 | 1 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S |
titanic.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 418 non-null int64 1 Survived 418 non-null int64 2 Pclass 418 non-null int64 3 Name 418 non-null object 4 Sex 418 non-null object 5 Age 332 non-null float64 6 SibSp 418 non-null int64 7 Parch 418 non-null int64 8 Ticket 418 non-null object 9 Fare 417 non-null float64 10 Cabin 91 non-null object 11 Embarked 418 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 39.3+ KB
# Checking null values
titanic.isna().sum()
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 86 SibSp 0 Parch 0 Ticket 0 Fare 1 Cabin 327 Embarked 0 dtype: int64
# Handling the null values
columns = ['Age', 'Fare']
for col in columns:
titanic[col].fillna(titanic[col].median(), inplace = True)
titanic['Cabin'].fillna('Unknown', inplace=True)
#checking duplicate values
dup = titanic.duplicated().sum()
print("The number of duplicated values in the dataset are: ", dup)
The number of duplicated values in the dataset are: 0
#Checking if there are any typos
for col in titanic.select_dtypes(include = "object"):
print(f"Name of Column: {col}")
print(titanic[col].unique())
print('\n', '-'*60, '\n')
Name of Column: Name ['Kelly, Mr. James' 'Wilkes, Mrs. James (Ellen Needs)' 'Myles, Mr. Thomas Francis' 'Wirz, Mr. Albert' 'Hirvonen, Mrs. Alexander (Helga E Lindqvist)' 'Svensson, Mr. Johan Cervin' 'Connolly, Miss. Kate' 'Caldwell, Mr. Albert Francis' 'Abrahim, Mrs. Joseph (Sophie Halaut Easu)' 'Davies, Mr. John Samuel' 'Ilieff, Mr. Ylio' 'Jones, Mr. Charles Cresson' 'Snyder, Mrs. John Pillsbury (Nelle Stevenson)' 'Howard, Mr. Benjamin' 'Chaffee, Mrs. Herbert Fuller (Carrie Constance Toogood)' 'del Carlo, Mrs. Sebastiano (Argenia Genovesi)' 'Keane, Mr. Daniel' 'Assaf, Mr. Gerios' 'Ilmakangas, Miss. Ida Livija' 'Assaf Khalil, Mrs. Mariana (Miriam")"' 'Rothschild, Mr. Martin' 'Olsen, Master. Artur Karl' 'Flegenheim, Mrs. Alfred (Antoinette)' 'Williams, Mr. Richard Norris II' 'Ryerson, Mrs. Arthur Larned (Emily Maria Borie)' 'Robins, Mr. Alexander A' 'Ostby, Miss. Helene Ragnhild' 'Daher, Mr. Shedid' 'Brady, Mr. John Bertram' 'Samaan, Mr. Elias' 'Louch, Mr. Charles Alexander' 'Jefferys, Mr. Clifford Thomas' 'Dean, Mrs. Bertram (Eva Georgetta Light)' 'Johnston, Mrs. Andrew G (Elizabeth Lily" Watson)"' 'Mock, Mr. Philipp Edmund' 'Katavelas, Mr. Vassilios (Catavelas Vassilios")"' 'Roth, Miss. Sarah A' 'Cacic, Miss. Manda' 'Sap, Mr. Julius' 'Hee, Mr. Ling' 'Karun, Mr. Franz' 'Franklin, Mr. Thomas Parham' 'Goldsmith, Mr. Nathan' 'Corbett, Mrs. Walter H (Irene Colvin)' 'Kimball, Mrs. Edwin Nelson Jr (Gertrude Parsons)' 'Peltomaki, Mr. Nikolai Johannes' 'Chevre, Mr. Paul Romaine' 'Shaughnessy, Mr. Patrick' 'Bucknell, Mrs. William Robert (Emma Eliza Ward)' 'Coutts, Mrs. William (Winnie Minnie" Treanor)"' 'Smith, Mr. Lucien Philip' 'Pulbaum, Mr. Franz' 'Hocking, Miss. Ellen Nellie""' 'Fortune, Miss. Ethel Flora' 'Mangiavacchi, Mr. Serafino Emilio' 'Rice, Master. Albert' 'Cor, Mr. Bartol' 'Abelseth, Mr. Olaus Jorgensen' 'Davison, Mr. Thomas Henry' 'Chaudanson, Miss. Victorine' 'Dika, Mr. Mirko' 'McCrae, Mr. Arthur Gordon' 'Bjorklund, Mr. Ernst Herbert' 'Bradley, Miss. Bridget Delia' 'Ryerson, Master. John Borie' 'Corey, Mrs. Percy C (Mary Phyllis Elizabeth Miller)' 'Burns, Miss. Mary Delia' 'Moore, Mr. Clarence Bloomfield' 'Tucker, Mr. Gilbert Milligan Jr' 'Fortune, Mrs. Mark (Mary McDougald)' 'Mulvihill, Miss. Bertha E' 'Minkoff, Mr. Lazar' 'Nieminen, Miss. Manta Josefina' 'Ovies y Rodriguez, Mr. Servando' 'Geiger, Miss. Amalie' 'Keeping, Mr. Edwin' 'Miles, Mr. Frank' 'Cornell, Mrs. Robert Clifford (Malvina Helen Lamson)' 'Aldworth, Mr. Charles Augustus' 'Doyle, Miss. Elizabeth' 'Boulos, Master. Akar' 'Straus, Mr. Isidor' 'Case, Mr. Howard Brown' 'Demetri, Mr. Marinko' 'Lamb, Mr. John Joseph' 'Khalil, Mr. Betros' 'Barry, Miss. Julia' 'Badman, Miss. Emily Louisa' "O'Donoghue, Ms. Bridget" 'Wells, Master. Ralph Lester' 'Dyker, Mrs. Adolf Fredrik (Anna Elisabeth Judith Andersson)' 'Pedersen, Mr. Olaf' 'Davidson, Mrs. Thornton (Orian Hays)' 'Guest, Mr. Robert' 'Birnbaum, Mr. Jakob' 'Tenglin, Mr. Gunnar Isidor' 'Cavendish, Mrs. Tyrell William (Julia Florence Siegel)' 'Makinen, Mr. Kalle Edvard' 'Braf, Miss. Elin Ester Maria' 'Nancarrow, Mr. William Henry' 'Stengel, Mrs. Charles Emil Henry (Annie May Morris)' 'Weisz, Mr. Leopold' 'Foley, Mr. William' 'Johansson Palmquist, Mr. Oskar Leander' 'Thomas, Mrs. Alexander (Thamine Thelma")"' 'Holthen, Mr. Johan Martin' 'Buckley, Mr. Daniel' 'Ryan, Mr. Edward' 'Willer, Mr. Aaron (Abi Weller")"' 'Swane, Mr. George' 'Stanton, Mr. Samuel Ward' 'Shine, Miss. Ellen Natalia' 'Evans, Miss. Edith Corse' 'Buckley, Miss. Katherine' 'Straus, Mrs. Isidor (Rosalie Ida Blun)' 'Chronopoulos, Mr. Demetrios' 'Thomas, Mr. John' 'Sandstrom, Miss. Beatrice Irene' 'Beattie, Mr. Thomson' 'Chapman, Mrs. John Henry (Sara Elizabeth Lawry)' 'Watt, Miss. Bertha J' 'Kiernan, Mr. John' 'Schabert, Mrs. Paul (Emma Mock)' 'Carver, Mr. Alfred John' 'Kennedy, Mr. John' 'Cribb, Miss. Laura Alice' 'Brobeck, Mr. Karl Rudolf' 'McCoy, Miss. Alicia' 'Bowenur, Mr. Solomon' 'Petersen, Mr. Marius' 'Spinner, Mr. Henry John' 'Gracie, Col. Archibald IV' 'Lefebre, Mrs. Frank (Frances)' 'Thomas, Mr. Charles P' 'Dintcheff, Mr. Valtcho' 'Carlsson, Mr. Carl Robert' 'Zakarian, Mr. Mapriededer' 'Schmidt, Mr. August' 'Drapkin, Miss. Jennie' 'Goodwin, Mr. Charles Frederick' 'Goodwin, Miss. Jessie Allis' 'Daniels, Miss. Sarah' 'Ryerson, Mr. Arthur Larned' 'Beauchamp, Mr. Henry James' 'Lindeberg-Lind, Mr. Erik Gustaf (Mr Edward Lingrey")"' 'Vander Planke, Mr. Julius' 'Hilliard, Mr. Herbert Henry' 'Davies, Mr. Evan' 'Crafton, Mr. John Bertram' 'Lahtinen, Rev. William' 'Earnshaw, Mrs. Boulton (Olive Potter)' 'Matinoff, Mr. Nicola' 'Storey, Mr. Thomas' 'Klasen, Mrs. (Hulda Kristina Eugenia Lofqvist)' 'Asplund, Master. Filip Oscar' 'Duquemin, Mr. Joseph' 'Bird, Miss. Ellen' 'Lundin, Miss. Olga Elida' 'Borebank, Mr. John James' 'Peacock, Mrs. Benjamin (Edith Nile)' 'Smyth, Miss. Julia' 'Touma, Master. Georges Youssef' 'Wright, Miss. Marion' 'Pearce, Mr. Ernest' 'Peruschitz, Rev. Joseph Maria' 'Kink-Heilmann, Mrs. Anton (Luise Heilmann)' 'Brandeis, Mr. Emil' 'Ford, Mr. Edward Watson' 'Cassebeer, Mrs. Henry Arthur Jr (Eleanor Genevieve Fosdick)' 'Hellstrom, Miss. Hilda Maria' 'Lithman, Mr. Simon' 'Zakarian, Mr. Ortin' 'Dyker, Mr. Adolf Fredrik' 'Torfa, Mr. Assad' 'Asplund, Mr. Carl Oscar Vilhelm Gustafsson' 'Brown, Miss. Edith Eileen' 'Sincock, Miss. Maude' 'Stengel, Mr. Charles Emil Henry' 'Becker, Mrs. Allen Oliver (Nellie E Baumgardner)' 'Compton, Mrs. Alexander Taylor (Mary Eliza Ingersoll)' 'McCrie, Mr. James Matthew' 'Compton, Mr. Alexander Taylor Jr' 'Marvin, Mrs. Daniel Warner (Mary Graham Carmichael Farquarson)' 'Lane, Mr. Patrick' 'Douglas, Mrs. Frederick Charles (Mary Helene Baxter)' 'Maybery, Mr. Frank Hubert' 'Phillips, Miss. Alice Frances Louisa' 'Davies, Mr. Joseph' 'Sage, Miss. Ada' 'Veal, Mr. James' 'Angle, Mr. William A' 'Salomon, Mr. Abraham L' 'van Billiard, Master. Walter John' 'Lingane, Mr. John' 'Drew, Master. Marshall Brines' 'Karlsson, Mr. Julius Konrad Eugen' 'Spedden, Master. Robert Douglas' 'Nilsson, Miss. Berta Olivia' 'Baimbrigge, Mr. Charles Robert' 'Rasmussen, Mrs. (Lena Jacobsen Solvang)' 'Murphy, Miss. Nora' 'Danbom, Master. Gilbert Sigvard Emanuel' 'Astor, Col. John Jacob' 'Quick, Miss. Winifred Vera' 'Andrew, Mr. Frank Thomas' 'Omont, Mr. Alfred Fernand' 'McGowan, Miss. Katherine' 'Collett, Mr. Sidney C Stuart' 'Rosenbaum, Miss. Edith Louise' 'Delalic, Mr. Redjo' 'Andersen, Mr. Albert Karvin' 'Finoli, Mr. Luigi' 'Deacon, Mr. Percy William' 'Howard, Mrs. Benjamin (Ellen Truelove Arman)' 'Andersson, Miss. Ida Augusta Margareta' 'Head, Mr. Christopher' 'Mahon, Miss. Bridget Delia' 'Wick, Mr. George Dennick' 'Widener, Mrs. George Dunton (Eleanor Elkins)' 'Thomson, Mr. Alexander Morrison' 'Duran y More, Miss. Florentina' 'Reynolds, Mr. Harold J' 'Cook, Mrs. (Selena Rogers)' 'Karlsson, Mr. Einar Gervasius' 'Candee, Mrs. Edward (Helen Churchill Hungerford)' 'Moubarek, Mrs. George (Omine Amenia" Alexander)"' 'Asplund, Mr. Johan Charles' 'McNeill, Miss. Bridget' 'Everett, Mr. Thomas James' 'Hocking, Mr. Samuel James Metcalfe' 'Sweet, Mr. George Frederick' 'Willard, Miss. Constance' 'Wiklund, Mr. Karl Johan' 'Linehan, Mr. Michael' 'Cumings, Mr. John Bradley' 'Vendel, Mr. Olof Edvin' 'Warren, Mr. Frank Manley' 'Baccos, Mr. Raffull' 'Hiltunen, Miss. Marta' 'Douglas, Mrs. Walter Donald (Mahala Dutton)' 'Lindstrom, Mrs. Carl Johan (Sigrid Posse)' 'Christy, Mrs. (Alice Frances)' 'Spedden, Mr. Frederic Oakley' 'Hyman, Mr. Abraham' 'Johnston, Master. William Arthur Willie""' 'Kenyon, Mr. Frederick R' 'Karnes, Mrs. J Frank (Claire Bennett)' 'Drew, Mr. James Vivian' 'Hold, Mrs. Stephen (Annie Margaret Hill)' 'Khalil, Mrs. Betros (Zahie Maria" Elias)"' 'West, Miss. Barbara J' 'Abrahamsson, Mr. Abraham August Johannes' 'Clark, Mr. Walter Miller' 'Salander, Mr. Karl Johan' 'Wenzel, Mr. Linhart' 'MacKay, Mr. George William' 'Mahon, Mr. John' 'Niklasson, Mr. Samuel' 'Bentham, Miss. Lilian W' 'Midtsjo, Mr. Karl Albert' 'de Messemaeker, Mr. Guillaume Joseph' 'Nilsson, Mr. August Ferdinand' 'Wells, Mrs. Arthur Henry (Addie" Dart Trevaskis)"' 'Klasen, Miss. Gertrud Emilia' 'Portaluppi, Mr. Emilio Ilario Giuseppe' 'Lyntakoff, Mr. Stanko' 'Chisholm, Mr. Roderick Robert Crispin' 'Warren, Mr. Charles William' 'Howard, Miss. May Elizabeth' 'Pokrnic, Mr. Mate' 'McCaffry, Mr. Thomas Francis' 'Fox, Mr. Patrick' 'Clark, Mrs. Walter Miller (Virginia McDowell)' 'Lennon, Miss. Mary' 'Saade, Mr. Jean Nassr' 'Bryhl, Miss. Dagmar Jenny Ingeborg ' 'Parker, Mr. Clifford Richard' 'Faunthorpe, Mr. Harry' 'Ware, Mr. John James' 'Oxenham, Mr. Percy Thomas' 'Oreskovic, Miss. Jelka' 'Peacock, Master. Alfred Edward' 'Fleming, Miss. Honora' 'Touma, Miss. Maria Youssef' 'Rosblom, Miss. Salli Helena' 'Dennis, Mr. William' 'Franklin, Mr. Charles (Charles Fardon)' 'Snyder, Mr. John Pillsbury' 'Mardirosian, Mr. Sarkis' 'Ford, Mr. Arthur' 'Rheims, Mr. George Alexander Lucien' 'Daly, Miss. Margaret Marcella Maggie""' 'Nasr, Mr. Mustafa' 'Dodge, Dr. Washington' 'Wittevrongel, Mr. Camille' 'Angheloff, Mr. Minko' 'Laroche, Miss. Louise' 'Samaan, Mr. Hanna' 'Loring, Mr. Joseph Holland' 'Johansson, Mr. Nils' 'Olsson, Mr. Oscar Wilhelm' 'Malachard, Mr. Noel' 'Phillips, Mr. Escott Robert' 'Pokrnic, Mr. Tome' 'McCarthy, Miss. Catherine Katie""' 'Crosby, Mrs. Edward Gifford (Catherine Elizabeth Halstead)' 'Allison, Mr. Hudson Joshua Creighton' 'Aks, Master. Philip Frank' 'Hays, Mr. Charles Melville' 'Hansen, Mrs. Claus Peter (Jennie L Howard)' 'Cacic, Mr. Jego Grga' 'Vartanian, Mr. David' 'Sadowitz, Mr. Harry' 'Carr, Miss. Jeannie' 'White, Mrs. John Stuart (Ella Holmes)' 'Hagardon, Miss. Kate' 'Spencer, Mr. William Augustus' 'Rogers, Mr. Reginald Harry' 'Jonsson, Mr. Nils Hilding' 'Jefferys, Mr. Ernest Wilfred' 'Andersson, Mr. Johan Samuel' 'Krekorian, Mr. Neshan' 'Nesson, Mr. Israel' 'Rowe, Mr. Alfred G' 'Kreuchen, Miss. Emilie' 'Assam, Mr. Ali' 'Becker, Miss. Ruth Elizabeth' 'Rosenshine, Mr. George (Mr George Thorne")"' 'Clarke, Mr. Charles Valentine' 'Enander, Mr. Ingvar' 'Davies, Mrs. John Morgan (Elizabeth Agnes Mary White) ' 'Dulles, Mr. William Crothers' 'Thomas, Mr. Tannous' 'Nakid, Mrs. Said (Waika Mary" Mowad)"' 'Cor, Mr. Ivan' 'Maguire, Mr. John Edward' 'de Brito, Mr. Jose Joaquim' 'Elias, Mr. Joseph' 'Denbury, Mr. Herbert' 'Betros, Master. Seman' 'Fillbrook, Mr. Joseph Charles' 'Lundstrom, Mr. Thure Edvin' 'Sage, Mr. John George' 'Cardeza, Mrs. James Warburton Martinez (Charlotte Wardle Drake)' 'van Billiard, Master. James William' 'Abelseth, Miss. Karen Marie' 'Botsford, Mr. William Hull' 'Whabee, Mrs. George Joseph (Shawneene Abi-Saab)' 'Giles, Mr. Ralph' 'Walcroft, Miss. Nellie' 'Greenfield, Mrs. Leo David (Blanche Strouse)' 'Stokes, Mr. Philip Joseph' 'Dibden, Mr. William' 'Herman, Mr. Samuel' 'Dean, Miss. Elizabeth Gladys Millvina""' 'Julian, Mr. Henry Forbes' 'Brown, Mrs. John Murray (Caroline Lane Lamson)' 'Lockyer, Mr. Edward' "O'Keefe, Mr. Patrick" 'Lindell, Mrs. Edvard Bengtsson (Elin Gerda Persson)' 'Sage, Master. William Henry' 'Mallet, Mrs. Albert (Antoinette Magnin)' 'Ware, Mrs. John James (Florence Louise Long)' 'Strilic, Mr. Ivan' 'Harder, Mrs. George Achilles (Dorothy Annan)' 'Sage, Mrs. John (Annie Bullen)' 'Caram, Mr. Joseph' 'Riihivouri, Miss. Susanna Juhantytar Sanni""' 'Gibson, Mrs. Leonard (Pauline C Boeson)' 'Pallas y Castello, Mr. Emilio' 'Giles, Mr. Edgar' 'Wilson, Miss. Helen Alice' 'Ismay, Mr. Joseph Bruce' 'Harbeck, Mr. William H' 'Dodge, Mrs. Washington (Ruth Vidaver)' 'Bowen, Miss. Grace Scott' 'Kink, Miss. Maria' 'Cotterill, Mr. Henry Harry""' 'Hipkins, Mr. William Edward' 'Asplund, Master. Carl Edgar' "O'Connor, Mr. Patrick" 'Foley, Mr. Joseph' 'Risien, Mrs. Samuel (Emma)' "McNamee, Mrs. Neal (Eileen O'Leary)" 'Wheeler, Mr. Edwin Frederick""' 'Herman, Miss. Kate' 'Aronsson, Mr. Ernst Axel Algot' 'Ashby, Mr. John' 'Canavan, Mr. Patrick' 'Palsson, Master. Paul Folke' 'Payne, Mr. Vivian Ponsonby' 'Lines, Mrs. Ernest H (Elizabeth Lindsey James)' 'Abbott, Master. Eugene Joseph' 'Gilbert, Mr. William' 'Kink-Heilmann, Mr. Anton' 'Smith, Mrs. Lucien Philip (Mary Eloise Hughes)' 'Colbert, Mr. Patrick' 'Frolicher-Stehli, Mrs. Maxmillian (Margaretha Emerentia Stehli)' 'Larsson-Rondberg, Mr. Edvard A' 'Conlon, Mr. Thomas Henry' 'Bonnell, Miss. Caroline' 'Gale, Mr. Harry' 'Gibson, Miss. Dorothy Winifred' 'Carrau, Mr. Jose Pedro' 'Frauenthal, Mr. Isaac Gerald' 'Nourney, Mr. Alfred (Baron von Drachstedt")"' 'Ware, Mr. William Jeffery' 'Widener, Mr. George Dunton' 'Riordan, Miss. Johanna Hannah""' 'Peacock, Miss. Treasteall' 'Naughton, Miss. Hannah' 'Minahan, Mrs. William Edward (Lillian E Thorpe)' 'Henriksson, Miss. Jenny Lovisa' 'Spector, Mr. Woolf' 'Oliva y Ocana, Dona. Fermina' 'Saether, Mr. Simon Sivertsen' 'Ware, Mr. Frederick' 'Peter, Master. Michael J'] ------------------------------------------------------------ Name of Column: Sex ['male' 'female'] ------------------------------------------------------------ Name of Column: Ticket ['330911' '363272' '240276' '315154' '3101298' '7538' '330972' '248738' '2657' 'A/4 48871' '349220' '694' '21228' '24065' 'W.E.P. 5734' 'SC/PARIS 2167' '233734' '2692' 'STON/O2. 3101270' '2696' 'PC 17603' 'C 17368' 'PC 17598' 'PC 17597' 'PC 17608' 'A/5. 3337' '113509' '2698' '113054' '2662' 'SC/AH 3085' 'C.A. 31029' 'C.A. 2315' 'W./C. 6607' '13236' '2682' '342712' '315087' '345768' '1601' '349256' '113778' 'SOTON/O.Q. 3101263' '237249' '11753' 'STON/O 2. 3101291' 'PC 17594' '370374' '11813' 'C.A. 37671' '13695' 'SC/PARIS 2168' '29105' '19950' 'SC/A.3 2861' '382652' '349230' '348122' '386525' '349232' '237216' '347090' '334914' 'F.C.C. 13534' '330963' '113796' '2543' '382653' '349211' '3101297' 'PC 17562' '113503' '359306' '11770' '248744' '368702' '2678' 'PC 17483' '19924' '349238' '240261' '2660' '330844' 'A/4 31416' '364856' '29103' '347072' '345498' 'F.C. 12750' '376563' '13905' '350033' '19877' 'STON/O 2. 3101268' '347471' 'A./5. 3338' '11778' '228414' '365235' '347070' '2625' 'C 4001' '330920' '383162' '3410' '248734' '237734' '330968' 'PC 17531' '329944' '2680' '2681' 'PP 9549' '13050' 'SC/AH 29037' 'C.A. 33595' '367227' '392095' '368783' '371362' '350045' '367226' '211535' '342441' 'STON/OQ. 369943' '113780' '4133' '2621' '349226' '350409' '2656' '248659' 'SOTON/OQ 392083' 'CA 2144' '113781' '244358' '17475' '345763' '17463' 'SC/A4 23568' '113791' '250651' '11767' '349255' '3701' '350405' '347077' 'S.O./P.P. 752' '347469' '110489' 'SOTON/O.Q. 3101315' '335432' '2650' '220844' '343271' '237393' '315153' 'PC 17591' 'W./C. 6608' '17770' '7548' 'S.O./P.P. 251' '2670' '2673' '29750' 'C.A. 33112' '230136' 'PC 17756' '233478' '113773' '7935' 'PC 17558' '239059' 'S.O./P.P. 2' 'A/4 48873' 'CA. 2343' '28221' '226875' '111163' 'A/5. 851' '235509' '28220' '347465' '16966' '347066' 'C.A. 31030' '65305' '36568' '347080' 'PC 17757' '26360' 'C.A. 34050' 'F.C. 12998' '9232' '28034' 'PC 17613' '349250' 'SOTON/O.Q. 3101308' 'S.O.C. 14879' '347091' '113038' '330924' '36928' '32302' 'SC/PARIS 2148' '342684' 'W./C. 14266' '350053' 'PC 17606' '2661' '350054' '370368' 'C.A. 6212' '242963' '220845' '113795' '3101266' '330971' 'PC 17599' '350416' '110813' '2679' '250650' 'PC 17761' '112377' '237789' '3470' '17464' '26707' 'C.A. 34651' 'SOTON/O2 3101284' '13508' '7266' '345775' 'C.A. 42795' 'AQ/4 3130' '363611' '28404' '345501' '345572' '350410' 'C.A. 34644' '349235' '112051' 'C.A. 49867' 'A. 2. 39186' '315095' '368573' '370371' '2676' '236853' 'SC 14888' '2926' 'CA 31352' 'W./C. 14260' '315085' '364859' '370129' 'A/5 21175' 'SOTON/O.Q. 3101314' '2655' 'A/5 1478' 'PC 17607' '382650' '2652' '33638' '345771' '349202' 'SC/Paris 2123' '113801' '347467' '347079' '237735' '315092' '383123' '112901' '392091' '12749' '350026' '315091' '2658' 'LP 1588' '368364' 'PC 17760' 'AQ/3. 30631' 'PC 17569' '28004' '350408' '347075' '2654' '244368' '113790' '24160' 'SOTON/O.Q. 3101309' 'PC 17585' '2003' '236854' 'PC 17580' '2684' '2653' '349229' '110469' '244360' '2675' '2622' 'C.A. 15185' '350403' 'PC 17755' '348125' '237670' '2688' '248726' 'F.C.C. 13528' 'PC 17759' 'F.C.C. 13540' '113044' '11769' '1222' '368402' '349910' 'S.C./PARIS 2079' '315083' '11765' '2689' '3101295' '112378' 'SC/PARIS 2147' '28133' '112058' '248746' '315152' '29107' '680' '366713' '330910' '364498' '376566' 'SC/PARIS 2159' '349911' '244346' '364858' '349909' 'PC 17592' 'C.A. 2673' 'C.A. 30769' '371109' '13567' '347065' '21332' '28664' '113059' '17765' 'SC/PARIS 2166' '28666' '334915' '365237' '19928' '347086' 'A.5. 3236' 'PC 17758' 'SOTON/O.Q. 3101262' '359309' '2668'] ------------------------------------------------------------ Name of Column: Cabin ['Unknown' 'B45' 'E31' 'B57 B59 B63 B66' 'B36' 'A21' 'C78' 'D34' 'D19' 'A9' 'D15' 'C31' 'C23 C25 C27' 'F G63' 'B61' 'C53' 'D43' 'C130' 'C132' 'C101' 'C55 C57' 'B71' 'C46' 'C116' 'F' 'A29' 'G6' 'C6' 'C28' 'C51' 'E46' 'C54' 'C97' 'D22' 'B10' 'F4' 'E45' 'E52' 'D30' 'B58 B60' 'E34' 'C62 C64' 'A11' 'B11' 'C80' 'F33' 'C85' 'D37' 'C86' 'D21' 'C89' 'F E46' 'A34' 'D' 'B26' 'C22 C26' 'B69' 'C32' 'B78' 'F E57' 'F2' 'A18' 'C106' 'B51 B53 B55' 'D10 D12' 'E60' 'E50' 'E39 E41' 'B52 B54 B56' 'C39' 'B24' 'D28' 'B41' 'C7' 'D40' 'D38' 'C105'] ------------------------------------------------------------ Name of Column: Embarked ['Q' 'S' 'C'] ------------------------------------------------------------
titanic.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 892 | 0 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | Unknown | Q |
| 1 | 893 | 1 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | Unknown | S |
| 2 | 894 | 0 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | Unknown | Q |
| 3 | 895 | 0 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | Unknown | S |
| 4 | 896 | 1 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | Unknown | S |
# Creating a new feature of title from name column based on the pattern found above
titanic['Title'] = titanic['Name'].str.extract(r',\s(.*?)\.')
titanic['Title'] = titanic['Title'].replace('Ms', 'Miss')
titanic['Title'] = titanic['Title'].replace('Dona', 'Mrs')
titanic['Title'] = titanic['Title'].replace(['Col', 'Rev', 'Dr'], 'Rare')
# Creating another feature of Age group by making bins
bins = [-np.inf, 17, 32, 45, 50, np.inf]
labels = ["Children", "Young", "Mid-Aged", "Senior-Adult", 'Elderly']
titanic['Age_Group'] = pd.cut(titanic['Age'], bins = bins, labels = labels)
# Generting another new feature of family size
titanic['Family'] = titanic['SibSp'] + titanic['Parch']
# Dropping non essential coclumns
titanic.drop(['PassengerId', 'Name', 'Ticket'], axis = 1, inplace = True)
titanic.head()
| Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Cabin | Embarked | Title | Age_Group | Family | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | male | 34.5 | 0 | 0 | 7.8292 | Unknown | Q | Mr | Mid-Aged | 0 |
| 1 | 1 | 3 | female | 47.0 | 1 | 0 | 7.0000 | Unknown | S | Mrs | Senior-Adult | 1 |
| 2 | 0 | 2 | male | 62.0 | 0 | 0 | 9.6875 | Unknown | Q | Mr | Elderly | 0 |
| 3 | 0 | 3 | male | 27.0 | 0 | 0 | 8.6625 | Unknown | S | Mr | Young | 0 |
| 4 | 1 | 3 | female | 22.0 | 1 | 1 | 12.2875 | Unknown | S | Mrs | Young | 2 |
# Chaning the positon of columns to place them right after their parent column
col_to_move = titanic.pop('Age_Group')
titanic.insert(4, 'Age_Group', col_to_move)
col_to_move = titanic.pop('Family')
titanic.insert(7, 'Family', col_to_move)
titanic['Age_Group'] = titanic['Age_Group'].astype('object')
titanic.describe()
| Survived | Pclass | Age | SibSp | Parch | Family | Fare | |
|---|---|---|---|---|---|---|---|
| count | 418.000000 | 418.000000 | 418.000000 | 418.000000 | 418.000000 | 418.000000 | 418.000000 |
| mean | 0.363636 | 2.265550 | 29.599282 | 0.447368 | 0.392344 | 0.839713 | 35.576535 |
| std | 0.481622 | 0.841838 | 12.703770 | 0.896760 | 0.981429 | 1.519072 | 55.850103 |
| min | 0.000000 | 1.000000 | 0.170000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 1.000000 | 23.000000 | 0.000000 | 0.000000 | 0.000000 | 7.895800 |
| 50% | 0.000000 | 3.000000 | 27.000000 | 0.000000 | 0.000000 | 0.000000 | 14.454200 |
| 75% | 1.000000 | 3.000000 | 35.750000 | 1.000000 | 0.000000 | 1.000000 | 31.471875 |
| max | 1.000000 | 3.000000 | 76.000000 | 8.000000 | 9.000000 | 10.000000 | 512.329200 |
titanic.describe(include = 'O')
| Sex | Age_Group | Cabin | Embarked | Title | |
|---|---|---|---|---|---|
| count | 418 | 418 | 418 | 418 | 418 |
| unique | 2 | 5 | 77 | 3 | 5 |
| top | male | Young | Unknown | S | Mr |
| freq | 266 | 257 | 327 | 270 | 240 |
titanic.groupby('Sex')[['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Family', 'Fare']].mean()
| Survived | Pclass | Age | SibSp | Parch | Family | Fare | |
|---|---|---|---|---|---|---|---|
| Sex | |||||||
| female | 1.0 | 2.144737 | 29.734145 | 0.565789 | 0.598684 | 1.164474 | 49.747699 |
| male | 0.0 | 2.334586 | 29.522218 | 0.379699 | 0.274436 | 0.654135 | 27.478728 |
titanic.groupby('Embarked')[['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Family', 'Fare']].mean()
| Survived | Pclass | Age | SibSp | Parch | Family | Fare | |
|---|---|---|---|---|---|---|---|
| Embarked | |||||||
| C | 0.392157 | 1.794118 | 33.220588 | 0.421569 | 0.382353 | 0.803922 | 66.259765 |
| Q | 0.521739 | 2.869565 | 28.108696 | 0.195652 | 0.021739 | 0.217391 | 10.957700 |
| S | 0.325926 | 2.340741 | 28.485185 | 0.500000 | 0.459259 | 0.959259 | 28.179413 |
survived_counts = titanic['Survived'].value_counts()
fig_surv_perc = px.pie(titanic, names= survived_counts.index, values = survived_counts.values, title=f'Distribution of Survived', hole=0.2, color_discrete_sequence=px.colors.sequential.Viridis)
fig_surv_perc.update_traces(textinfo='percent+label')
fig_surv_perc.update_layout(legend_title_text='Categories:', legend=dict(orientation="h", yanchor="bottom", y=1.02))
fig_surv_perc.show()
pclass_counts = titanic.Pclass.value_counts()
fig_pclass_perc = px.pie(titanic, names= pclass_counts.index, values = pclass_counts.values, title=f'Distribution of Pclass', hole=0.2, color_discrete_sequence=px.colors.sequential.Viridis)
fig_pclass_perc.update_traces(textinfo='percent+label')
fig_pclass_perc.update_layout(legend_title_text='Categories:', legend=dict(orientation="h", yanchor="bottom", y=1.02))
fig_pclass_perc.show()
fig_sex_count = px.histogram(titanic, x = 'Sex', color = 'Sex', color_discrete_sequence=px.colors.sequential.Viridis)
fig_sex_count.update_layout(title_text='Count of different Sex', xaxis_title='Sex', yaxis_title='Count', plot_bgcolor = 'white')
fig_sex_count.show()
fig_sex_perc = px.pie(titanic, names= 'Sex', title=f'Distribution of Sex', hole=0.2, color_discrete_sequence=px.colors.sequential.Viridis)
fig_sex_perc.update_traces(textinfo='percent+label')
fig_sex_perc.update_layout(legend_title_text='Categories:', legend=dict(orientation="h", yanchor="bottom", y=1.02))
fig_sex_perc.show()
fig_age = px.histogram(titanic, x='Age', nbins=30, histnorm='probability density')
fig_age.update_traces(marker=dict(color='#440154'), selector=dict(type='histogram'))
fig_age.update_layout(title='Distribution of Age', title_x=0.5, title_pad=dict(t=20), title_font=dict(size=20), xaxis_title='Age', yaxis_title='Probability Density', xaxis=dict(showgrid=False), yaxis=dict(showgrid=False), bargap=0.02, plot_bgcolor = 'white')
fig_age.show()
fig_fare = px.histogram(titanic, x='Fare', nbins=30, histnorm='probability density')
fig_fare.update_traces(marker=dict(color='#440154'), selector=dict(type='histogram'))
fig_fare.update_layout(title='Distribution of Fare', title_x=0.5, title_pad=dict(t=20), title_font=dict(size=20), xaxis_title='Fare', yaxis_title='Probability Density', xaxis=dict(showgrid=False), yaxis=dict(showgrid=False), bargap=0.02, plot_bgcolor = 'white')
fig_fare.show()
fig_embarked_count = px.histogram(titanic, x = 'Embarked', color = 'Embarked', color_discrete_sequence=px.colors.sequential.Viridis)
fig_embarked_count.update_layout(title_text='Count of different Embarked', xaxis_title='Embarked', yaxis_title='Count', plot_bgcolor = 'white')
fig_embarked_count.show()
fig_embarked_perc = px.pie(titanic, names= 'Embarked', title=f'Distribution of Embarked', hole=0.2, color_discrete_sequence=px.colors.sequential.Viridis)
fig_embarked_perc.update_traces(textinfo='percent+label')
fig_embarked_perc.update_layout(legend_title_text='Categories:', legend=dict(orientation="h", yanchor="bottom", y=1.02))
fig_embarked_perc.show()
fig_title_count = px.histogram(titanic, x = 'Title', color = 'Title', color_discrete_sequence=px.colors.sequential.Viridis)
fig_title_count.update_layout(title_text='Count of different Title', xaxis_title='Title', yaxis_title='Count', plot_bgcolor = 'white')
fig_title_count.show()
fig_title_perc = px.pie(titanic, names= 'Title', title=f'Distribution of Title', hole=0.2, color_discrete_sequence=px.colors.sequential.Viridis)
fig_title_perc.update_traces(textinfo='percent+label')
fig_title_perc.update_layout(legend_title_text='Categories:', legend=dict(orientation="h", yanchor="bottom", y=1.02))
fig_title_perc.show()
Only 36.4% of the people survived the crash
The dataset also have a high distribution of poeple from Pclass = 3, and high ratio of males
The distribution of age is centered around 25-29, and fare is around 10-30
Most of the people are embarked from Southampton, and mostly the title holded by passengers are Mr. = Single Male
fig_pclass_surv = px.histogram(titanic, x = 'Pclass', barmode = 'group', color = 'Survived', color_discrete_sequence=px.colors.sequential.Viridis)
fig_pclass_surv.update_layout(title = 'Survival according to passenger classes', plot_bgcolor = 'white')
fig_pclass_surv.show()
fig_pclass_surv = px.histogram(titanic, x = 'Sex', barmode = 'group', color = 'Survived', color_discrete_sequence=px.colors.sequential.Viridis)
fig_pclass_surv.update_layout(title = 'Survival according to gender', plot_bgcolor = 'white')
fig_pclass_surv.show()
fig_embarked_surv = px.histogram(titanic, x = 'Age_Group', barmode = 'group', color = 'Survived', color_discrete_sequence=px.colors.sequential.Viridis)
fig_embarked_surv.update_layout(title = 'Survival according to age groups', plot_bgcolor = 'white')
fig_embarked_surv.show()
fig_family_surv = px.histogram(titanic, x = 'Family', barmode = 'group', color = 'Survived', color_discrete_sequence=px.colors.sequential.Viridis)
fig_family_surv.update_layout(title = 'Survival according to number of family members', plot_bgcolor = 'white')
fig_family_surv.show()
fig_embarked_surv = px.histogram(titanic, x = 'Embarked', barmode = 'group', color = 'Survived', color_discrete_sequence=px.colors.sequential.Viridis)
fig_embarked_surv.update_layout(title = 'Survival according to embarked', plot_bgcolor = 'white')
fig_embarked_surv.show()
The least deaths are from Pclass = 1 and the highest number of deaths are from Pclass = 3
The dataset also have a high distribution of poeple from Pclass = 3, and high ratio of males
None of the male survived, and all the females survived
The highest death count is from Young Age Group, and Elderly People have a good survival count
Poeple with few family members are more likely to survive according to analysis
A high ratio of poeple who embarked from Queenstown survived, and Southampton has the highest death casualities
grouped_data = titanic.groupby(['Age', 'Sex', 'Survived']).agg({'Fare': 'mean'}).reset_index()
fig = px.line(grouped_data, x='Age', y='Fare', color='Survived', facet_col='Sex', facet_col_wrap=2, labels={'Fare': 'Fare', 'Survived': 'Survived'}, title='12. Relation of age and gender with fare')
fig.update_layout(hovermode='x unified', plot_bgcolor = 'white')
fig.update_xaxes(title_text='Age')
fig.update_yaxes(title_text='Fair', row=1, col=1)
fig.show()
# Labeling the ordinal variables
le = LabelEncoder()
cols = ['Sex', 'Age_Group', 'Cabin', 'Embarked', 'Title']
for col in cols:
titanic[col] = le.fit_transform(titanic[col])
# Checking the class count for target variable
titanic.Survived.value_counts()
Survived 0 266 1 152 Name: count, dtype: int64
X = titanic.drop('Survived', axis = 1)
y = titanic['Survived']
# Using the SMOTE technique to handle class imbalance
smote = SMOTE(random_state = 42)
X_balanced, y_balanced = smote.fit_resample(X, y)
# Splitting the dataset into training and testing parts
X_train, X_test, y_train, y_test = train_test_split(X_balanced, y_balanced, test_size = 0.3, random_state = 42)
# Doing feature scaling by StandardScaler
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)
# Building the models
lr = LogisticRegression()
rf = RandomForestClassifier()
gbc = GradientBoostingClassifier()
lr.fit(X_train_scaled, y_train)
rf.fit(X_train_scaled, y_train)
gbc.fit(X_train_scaled, y_train)
lr_pred = lr.predict(X_test_scaled)
rf_pred = rf.predict(X_test_scaled)
gbc_pred = gbc.predict(X_test_scaled)
# Evaluating the models by generating classification report and cross validation scores
lr_report = classification_report(y_test, lr_pred)
lr_scores = cross_val_score(lr, X_train_scaled, y_train, cv=5, scoring='accuracy')
rf_report = classification_report(y_test, rf_pred)
rf_scores = cross_val_score(rf, X_train_scaled, y_train, cv=5, scoring='accuracy')
gbc_report = classification_report(y_test, gbc_pred)
gbc_scores = cross_val_score(gbc, X_train_scaled, y_train, cv=5, scoring='accuracy')
print('The classification report of Logistic Regression is below : ', '\n\n\n', lr_report)
print(f"Logistic Regression Mean Cross-Validation Score: {lr_scores}")
print('\n', '='*100, '\n')
print('The classification report of Random Forest is below : ', '\n\n\n', rf_report)
print(f"Random Forest Mean Cross-Validation Score: {rf_scores}")
print('\n', '='*100, '\n')
print('The classification report of Gradient Bossting Classifier is below : ', '\n\n\n', rf_report)
print(f"Gradient Boosting Classifier Mean Cross-Validation Score: {gbc_scores}")
The classification report of Logistic Regression is below :
precision recall f1-score support
0 1.00 1.00 1.00 78
1 1.00 1.00 1.00 82
accuracy 1.00 160
macro avg 1.00 1.00 1.00 160
weighted avg 1.00 1.00 1.00 160
Logistic Regression Mean Cross-Validation Score: [1. 1. 1. 1. 1.]
====================================================================================================
The classification report of Random Forest is below :
precision recall f1-score support
0 1.00 1.00 1.00 78
1 1.00 1.00 1.00 82
accuracy 1.00 160
macro avg 1.00 1.00 1.00 160
weighted avg 1.00 1.00 1.00 160
Random Forest Mean Cross-Validation Score: [1. 1. 1. 1. 1.]
====================================================================================================
The classification report of Gradient Bossting Classifier is below :
precision recall f1-score support
0 1.00 1.00 1.00 78
1 1.00 1.00 1.00 82
accuracy 1.00 160
macro avg 1.00 1.00 1.00 160
weighted avg 1.00 1.00 1.00 160
Gradient Boosting Classifier Mean Cross-Validation Score: [1. 1. 1. 1. 1.]
In this Titanic Survival Prediction analysis, we have explored various aspects of the dataset to understand the factors influencing survival. We found that only 36.4% of the passengers survived the crash, with significant differences in survival rates among different passenger classes, genders, and age groups. The dataset also revealed that certain features, such as Fare and embarkation location, played a role in survival. We trained several classification models to predict survival, all of which performed well, likely due to the relatively small dataset size.
Insights:
Our analysis unveiled key insights into the Titanic dataset. We addressed missing values by filling null entries in the Age and Fare columns with medians due to the presence of outliers, while the Cabin column was filled with "Unknown." New features, including Title, Age_Group, and Family, were created to enhance our understanding of passenger demographics. We discovered that young males traveling from Southampton constituted the majority, and females were more likely to travel with others and pay higher fares. Notably, passengers from Cherbourg had an average age of 33 and paid around 66 pounds in fares. Furthermore, we observed that Pclass 3 had the highest number of deaths, with no surviving males and all females surviving. Family size appeared to influence survival, and passengers from Queenstown had a higher survival rate compared to those from Southampton.
What's next?
For future analysis, it would be beneficial to explore more advanced machine learning techniques and consider feature engineering to improve model performance further. Additionally, investigating the impact of other variables not included in this analysis, such as cabin location and passenger demographics beyond age, gender, and family size, could provide deeper insights. Further exploration of the dataset and refining models could enhance our ability to predict Titanic passenger survival more accurately.